Effective Dimension Reduction Techniques for Text Documents

نویسندگان

  • P. Ponmuthuramalingam
  • T. Devi
چکیده

Frequent term based text clustering is a text clustering technique, which uses frequent term set and dramatically decreases the dimensionality of the document vector space, thus especially addressing: very high dimensionality of the data and very large size of the databases. Frequent Term based Clustering algorithm (FTC) has shown significant efficiency comparing to some well known text clustering methods, but the quality of clustering still needs further enhancement. In this paper, the morphological variant words, stopwords and grammatical words are identified and removed for further dimension reduction. Two effective dimension reduction algorithms, improved stemming and frequent term generation algorithms have been presented. An experiment on classical text documents as well as on web documents demonstrates that the developed algorithms yield good dimension reduction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dimension Reduction Methods of Text Documents by Neural Networks

The paper is oriented to introduce different dimension reduction methods in the text document retrieval area. First, the mostly used text document retrieval models are described, and then in second part the analytical approach and neural network approaches to dimension reduction of keyword space are described. Dimension reduction methods reduce keyword space into much smaller size together with...

متن کامل

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond

The bag-of-words representation commonly used in text analysis can be analyzed very efficiently and retains a great deal of useful information, but it is also troublesome because the same thought can be expressed using many different terms or one term can have very different meanings. Dimension reduction can collapse together terms that have the same semantics, to identify and disambiguate term...

متن کامل

Automated Text Categorization Using Support Vector Machine

In this paper, we study the use of support vector machine in text categorization. Unlike other machine learning techniques , it allows easy incorporation of new documents into an existing trained system. Moreover, dimension reduction, which is usually imperative, now becomes optional. Thus, SVM adapts eeciently in dynamic environments that require frequent additions to the document collection. ...

متن کامل

Temporal Text Mining: A Thematic Exploration of Don Quixote

Temporal text mining (TTM) is the discovery of temporal patterns in documents that are collected over time. It involves discovery of latent themes, construction of a thematic evolution graph, and analysis of thematic patterns. This paper uses text mining and time series analysis techniques to explore Don Quixote de la Mancha, a two-volume master work of Western literature. First, it uses singul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010